Deriving an Ambiguous Word's Part-of-Speech Distribution from Unannotated Text

نویسنده

  • Reinhard Rapp
چکیده

A distributional method for part-of-speech induction is presented which, in contrast to most previous work, determines the part-of-speech distribution of syntactically ambiguous words without explicitly tagging the underlying text corpus. This is achieved by assuming that the word pair consisting of the left and right neighbor of a particular token is characteristic of the part of speech at this position, and by clustering the neighbor pairs on the basis of their middle words as observed in a large corpus. The results obtained in this way are evaluated by comparing them to the part-of-speech distributions as found in the manually tagged Brown corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Learning on an Approximate Corpus

Unsupervised learning techniques can take advantage of large amounts of unannotated text, but the largest text corpus (the Web) is not easy to use in its full form. Instead, we have statistics about this corpus in the form of n-gram counts (Brants and Franz, 2006). While n-gram counts do not directly provide sentences, a distribution over sentences can be estimated from them in the same way tha...

متن کامل

Considering a resource-light approach to learning verb valencies

Here we describe work on learning the subcategories of verbs in a morphologically rich language using only minimal linguistic resources. Our goal is to learn verb subcategorizations for Quechua, an under-resourced morphologically rich language, from an unannotated corpus. We compare results from applying this approach to an unannotated Arabic corpus with those achieved by processing the same te...

متن کامل

An Approach to Speed-up the Word Sense Disambiguation Procedure through Sense Filtering

In this paper, we are going to focus on speed up of the Word Sense Disambiguation procedure by filtering the relevant senses of an ambiguous word through Part-of-Speech Tagging. First, this proposed approach performs the Part-of-Speech Tagging operation before the disambiguation procedure using Bigram approximation. As a result, the exact Part-of-Speech of the ambiguous word at a particular tex...

متن کامل

Tagging an Unfamiliar Text With Minimal Human Supervision

In this paper, we will discuss a method for tagging an unannotated text corpus whose structure is completely unknown, with a little bit of help from an informant. Starting from scratch, automated and semi-automated methods are employed to build a part of speech tagger for the text. There are three steps to building the tagger: uncovering a set of part of speech tags, discovering for each word i...

متن کامل

Tagging Unknown Words with Raw Text Features

Processing unknown words is disproportionately important because of their high information content. It is crucial in domains with specialist vocabularies where relevant training material is scarce, for example: biological text. Unknown word processing often begins with Part of Speech (POS) tagging, where accuracy is typically 10% worse than on known words. We demonstrate that features extracted...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007